## [1] 113937 81
The dataset contains 81 variables and 113K+ entries. We’ll keep only 14 columns and 10K entries for a reasonable processing time. Chosen columns should help achieve our goal.
## [1] 113937 81
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## [1] 10000 81
## [1] 10000 16
## 'data.frame': 10000 obs. of 16 variables:
## $ ProsperScore : num 3 7 NA NA NA 6 NA NA NA NA ...
## $ Term : int 36 36 36 36 36 36 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 2 2 3 3 4 3 3 3 3 ...
## $ BorrowerRate : num 0.285 0.205 0.11 0.111 0.2 ...
## $ LenderYield : num 0.275 0.1949 0.0999 0.1011 0.19 ...
## $ ListingCategory..numeric.: int 7 13 4 0 4 2 0 1 6 1 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 2 2 3 3 3 2 3 3 3 3 ...
## $ EmploymentStatusDuration : int 80 70 39 56 229 26 35 3 5 56 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 1 1 1 1 ...
## $ CurrentCreditLines : int 14 12 3 4 5 13 1 8 5 10 ...
## $ DebtToIncomeRatio : num 0.09 0.15 0.18 0.13 0.11 0.14 0.11 0.07 0.14 NA ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 5 6 4 5 5 5 4 3 4 1 ...
## $ StatedMonthlyIncome : num 5250 7500 3120 5598 4167 ...
## $ LoanOriginalAmount : int 2000 4500 9500 15000 2500 10000 2750 1000 7000 5000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1843 1394 574 304 608 1846 310 612 532 541 ...
## $ MonthlyLoanPayment : num 83.3 168.4 311 491.9 92.9 ...
# 11 is less risky
# transforming ProsperScore to factor
is.factor(pl_subsamp$ProsperScore)
## [1] FALSE
pl_subsamp$ProsperScore <- factor(pl_subsamp$ProsperScore, levels = c(1:11))
summary(pl_subsamp$ProsperScore)
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 93 474 701 1111 857 1057 898 1066 608 437 116 2582
qplot(x = ProsperScore, data = subset(pl_subsamp, !is.na(ProsperScore)),
xlab = "Prosper Score") +
scale_x_discrete(breaks = seq(1, 11, 1))
Most loans are of middle risk profiles, and low and high risks are almost equally served. It will be interesting to check how the risk and the borrower rate relate to each other.
pl_subsamp$Term <- factor(pl_subsamp$Term)
summary(pl_subsamp$Term)
## 12 36 60
## 142 7716 2142
qplot(x = Term, data = pl_subsamp, binwidth=1,
xlab = "Term in Months")
There are 3 loan terms: 1y, 3y, 5y. Term could be considered as a categorical variable. However, its ditribution shows that there are only 3 terms. So we chose to factorise this variable and consider it as ordinal.
# most of loans are current, very few are defaulted, some are charged off
summary(pl_subsamp$LoanStatus)
## Cancelled Chargedoff Completed
## 1 1056 3321
## Current Defaulted FinalPaymentInProgress
## 5026 406 15
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0 62 27
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 27 25 34
qplot(x = LoanStatus, data = pl_subsamp,
fill = LoanStatus, xlab = "Loan Status") +
scale_x_discrete(breaks = NULL)
count(pl_subsamp[pl_subsamp$LoanStatus == "Defaulted",]) / dim(pl_subsamp)[1]
## n
## 1 0.0406
Only 4.06% of loans have resulted in a default since 2005.
summary(pl_subsamp$BorrowerRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.1334 0.1815 0.1915 0.2498 0.4500
qplot(x = BorrowerRate*100, data = pl_subsamp,
binwidth = 1) +
scale_x_continuous(breaks = seq(0, 36, 2)) +
xlab("Borrower Rate, %")
qplot(x = BorrowerRate*100, data = pl_subsamp) +
scale_x_log10() +
xlab("Borrower Rate, %")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(x = BorrowerRate*100, data = pl_subsamp) +
scale_x_sqrt() +
xlab("Borrower Rate, %")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(aes(x = BorrowerRate*100),
data = pl_subsamp) +
geom_density() +
xlab("Borrower Rate, %")
3/4 quantile of is ~25%. There is a peak at 31%; what could be the reason?
# a peak @ 31%
summary(pl_subsamp$LenderYield)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1234 0.1700 0.1815 0.2395 0.4325
qplot(x = LenderYield*100, data = pl_subsamp,
binwidth = 1, xlab = "Lender Yield %") +
scale_x_continuous(breaks = seq(0, 36, 2))
qplot(x = LenderYield, data = pl_subsamp) +
scale_x_log10()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
qplot(x = LenderYield, data = pl_subsamp) +
scale_x_sqrt()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
It looks like BorrowerRate.
# peak for debt consolidation
pl_subsamp$ListingCategory <- pl_subsamp$ListingCategory..numeric.
pl_subsamp$ListingCategory <-
plyr::mapvalues(pl_subsamp$ListingCategory,
from = c(0:20),
to = c("Not Available", "Debt Consolidation", "Home Improvement",
"Business",
"Personal Loan", "Student Use", "Auto", "Other",
"Baby&Adoption", "Boat", "Cosmetic Procedure",
"Engagement Ring", "Green Loans", "Household Expenses",
"Large Purchases", "Medical/Dental", "Motorcycle", "RV",
"Taxes", "Vacation", "Wedding Loans"))
table(pl_subsamp$ListingCategory..numeric.)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1494 5122 645 663 200 65 217 917 23 3 6 20 8 180 78
## 15 16 17 18 19 20
## 122 21 2 78 70 66
table(pl_subsamp$ListingCategory)
##
## Auto Baby&Adoption Boat
## 217 23 3
## Business Cosmetic Procedure Debt Consolidation
## 663 6 5122
## Engagement Ring Green Loans Home Improvement
## 20 8 645
## Household Expenses Large Purchases Medical/Dental
## 180 78 122
## Motorcycle Not Available Other
## 21 1494 917
## Personal Loan RV Student Use
## 200 2 65
## Taxes Vacation Wedding Loans
## 78 70 66
pl_subsamp.backup.ListingCategory..numeric. <- pl_subsamp$ListingCategory..numeric.
pl_subsamp <- subset(pl_subsamp, select = -ListingCategory..numeric.)
pl_subsamp$ListingCategory <- factor(pl_subsamp$ListingCategory)
dim(pl_subsamp)
## [1] 10000 16
qplot(data = pl_subsamp, x = ListingCategory,
fill = ListingCategory, xlab = "Listing Category") +
scale_x_discrete(breaks = NULL)
with(pl_subsamp, sum(ListingCategory == "Debt Consolidation") /
length(ListingCategory))
## [1] 0.5122
~51% of loans are for debt consolidation.
# most of customers are employed
summary(pl_subsamp$EmploymentStatus)
## Employed Full-time Not available Not employed
## 190 5887 2355 452 69
## Other Part-time Retired Self-employed
## 364 98 73 512
qplot(x = EmploymentStatus,
data = subset(pl_subsamp, EmploymentStatus != ""),
fill = EmploymentStatus, xlab = "Employment Status") +
scale_x_discrete(breaks = NULL)
Employed borrowers are more likely to get a loan.
# half of customers have been working for less than 6 years
# right skewed distribution with a mean of 9 years
summary(pl_subsamp$EmploymentStatusDuration)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 25.00 67.00 94.95 135.00 634.00 644
qplot(x = EmploymentStatusDuration/12.,
data = subset(pl_subsamp, !is.na(EmploymentStatusDuration)),
binwidth = 1,
xlab = "Employment Status Duration, Years") +
scale_x_continuous(breaks = seq(0, 800/12., 4))
qplot(x = EmploymentStatusDuration/12.+1/365.,
data = subset(pl_subsamp, !is.na(EmploymentStatusDuration)),
binwidth = 1./12.,
xlab = "Employment Status Duration, Years") +
scale_x_log10()
qplot(x = EmploymentStatusDuration/12.+1/365.,
data = subset(pl_subsamp, !is.na(EmploymentStatusDuration)),
binwidth = 1./12.,
xlab = "Employment Status Duration, Years") +
scale_x_sqrt()
ggplot(aes(x = EmploymentStatusDuration/12.),
data = subset(pl_subsamp, !is.na(EmploymentStatusDuration))) +
geom_density() +
xlab("Employment Status Duration, Years")
3/4 quantile of employment duration is 11+ years, mean is ~8- years, median is 5.5 years and 1/4 quantile is ~2+ years. Distribution is log-normal with right skew and many borrowers are fresh workers.
summary(pl_subsamp$IsBorrowerHomeowner)
## False True
## 4910 5090
qplot(data = pl_subsamp, x = IsBorrowerHomeowner)
50.9% of customers own their home. Could this variable have any impact?
# normal distribution
summary(pl_subsamp$CurrentCreditLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.38 13.00 54.00 642
qplot(data = pl_subsamp, x = CurrentCreditLines,
binwidth = 1, xlab = "Current Credit Lines")
qplot(data = pl_subsamp, x = CurrentCreditLines,
binwidth = 0.01, xlab = "Current Credit Lines") +
scale_x_log10()
Half of customers have already 10 credit lines opened at the time of the listing. The distribution is normal with a right skew.
# skewed distribution to the right
# square root of DebtToIncomeRatio is normal
summary(pl_subsamp$DebtToIncomeRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1400 0.2200 0.2741 0.3100 10.0100 717
sum(subset(pl_subsamp, !is.na(DebtToIncomeRatio))$DebtToIncomeRatio>10)
## [1] 25
qplot(data = subset(pl_subsamp, !is.na(DebtToIncomeRatio)),
x = DebtToIncomeRatio,
binwidth = 0.02) +
xlim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
xlab("Debt To Income Ratio")
qplot(data = subset(pl_subsamp, !is.na(DebtToIncomeRatio)),
x = DebtToIncomeRatio,
binwidth = 0.15) +
scale_x_log10() +
xlab("Debt To Income Ratio")
qplot(data = subset(pl_subsamp, !is.na(DebtToIncomeRatio)),
x = DebtToIncomeRatio,
binwidth = 0.04) +
scale_x_sqrt() +
xlab("Debt To Income Ratio")
3/4 have a debt to income ratio less than 31%, with 25 exceeding 1000%.
# middle income tend to borrow more
# low inclome have difficult access to loans
# higher income borrow less
table(pl_subsamp$IncomeRange)
##
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 55 637 1548 2837 2778
## $75,000-99,999 Not displayed Not employed
## 1427 649 69
is.factor(pl_subsamp$IncomeRange)
## [1] TRUE
levels(pl_subsamp$IncomeRange)
## [1] "$0" "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
pl_subsamp.backup.IncomeRange <- pl_subsamp$IncomeRange
pl_subsamp$IncomeRange <-
factor(pl_subsamp$IncomeRange,
levels(pl_subsamp$IncomeRange)[c(7, 8, 1, 2, 4, 5, 6, 3)])
qplot(data = pl_subsamp, x = IncomeRange,
fill = IncomeRange, xlab = "Income Range") +
scale_x_discrete(breaks = NULL)
Most of loans are requested by middle-income customers, $25K-$75K.
summary(pl_subsamp$StatedMonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3250 4651 5609 6750 394400
qplot(x = StatedMonthlyIncome, data = pl_subsamp,
geom = "density") +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99))
## Warning: Removed 100 rows containing non-finite values (stat_density).
qplot(x = StatedMonthlyIncome, data = pl_subsamp,
binwidth = 500) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99))
qplot(x = StatedMonthlyIncome, data = pl_subsamp,
binwidth = 0.1) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99)) +
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
A lognormal distribution.
# some peaks around 4k, 11k, 16k, 21k, 26k
# distribution like bi-modal at 4k and 16k
summary(pl_subsamp$LoanOriginalAmount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8419 12000 35000
qplot(data = pl_subsamp, x = LoanOriginalAmount,
binwidth = 1000, geom = "density",
xlab = "Loan Original Amount")
qplot(data = pl_subsamp, x = LoanOriginalAmount,
binwidth = 1000,
xlab = "Loan Original Amount") +
scale_x_continuous(breaks = seq(0, 35000, 5000))
qplot(data = pl_subsamp, x = LoanOriginalAmount,
binwidth = 5000,
xlab = "Loan Original Amount") +
scale_x_continuous(breaks = seq(0, 35000, 5000))
qplot(data = pl_subsamp, x = LoanOriginalAmount,
binwidth = 0.2,
xlab = "Loan Original Amount") +
scale_x_log10()
qplot(data = pl_subsamp, x = LoanOriginalAmount,
binwidth = 20,
xlab = "Loan Original Amount") +
scale_x_sqrt()
There are some peaks for the original amount at 4K, 11K and 16K. Mean is ~$8400.
str(pl_subsamp)
## 'data.frame': 10000 obs. of 16 variables:
## $ ProsperScore : Factor w/ 11 levels "1","2","3","4",..: 3 7 NA NA NA 6 NA NA NA NA ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 2 2 2 2 2 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 2 2 3 3 4 3 3 3 3 ...
## $ BorrowerRate : num 0.285 0.205 0.11 0.111 0.2 ...
## $ LenderYield : num 0.275 0.1949 0.0999 0.1011 0.19 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 2 2 3 3 3 2 3 3 3 3 ...
## $ EmploymentStatusDuration: int 80 70 39 56 229 26 35 3 5 56 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 1 1 1 1 ...
## $ CurrentCreditLines : int 14 12 3 4 5 13 1 8 5 10 ...
## $ DebtToIncomeRatio : num 0.09 0.15 0.18 0.13 0.11 0.14 0.11 0.07 0.14 NA ...
## $ IncomeRange : Factor w/ 8 levels "Not displayed",..: 6 7 5 6 6 6 5 8 5 3 ...
## $ StatedMonthlyIncome : num 5250 7500 3120 5598 4167 ...
## $ LoanOriginalAmount : int 2000 4500 9500 15000 2500 10000 2750 1000 7000 5000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1843 1394 574 304 608 1846 310 612 532 541 ...
## $ MonthlyLoanPayment : num 83.3 168.4 311 491.9 92.9 ...
## $ ListingCategory : Factor w/ 21 levels "Auto","Baby&Adoption",..: 15 10 16 14 16 9 14 6 1 6 ...
pl_subsamp$LoanOriginationDate <-
as.Date(pl_subsamp$LoanOriginationDate, "%Y-%m-%d")
str(pl_subsamp)
## 'data.frame': 10000 obs. of 16 variables:
## $ ProsperScore : Factor w/ 11 levels "1","2","3","4",..: 3 7 NA NA NA 6 NA NA NA NA ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 2 2 2 2 2 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 2 2 3 3 4 3 3 3 3 ...
## $ BorrowerRate : num 0.285 0.205 0.11 0.111 0.2 ...
## $ LenderYield : num 0.275 0.1949 0.0999 0.1011 0.19 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 2 2 3 3 3 2 3 3 3 3 ...
## $ EmploymentStatusDuration: int 80 70 39 56 229 26 35 3 5 56 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 1 1 1 1 ...
## $ CurrentCreditLines : int 14 12 3 4 5 13 1 8 5 10 ...
## $ DebtToIncomeRatio : num 0.09 0.15 0.18 0.13 0.11 0.14 0.11 0.07 0.14 NA ...
## $ IncomeRange : Factor w/ 8 levels "Not displayed",..: 6 7 5 6 6 6 5 8 5 3 ...
## $ StatedMonthlyIncome : num 5250 7500 3120 5598 4167 ...
## $ LoanOriginalAmount : int 2000 4500 9500 15000 2500 10000 2750 1000 7000 5000 ...
## $ LoanOriginationDate : Date, format: "2014-01-28" "2012-04-12" ...
## $ MonthlyLoanPayment : num 83.3 168.4 311 491.9 92.9 ...
## $ ListingCategory : Factor w/ 21 levels "Auto","Baby&Adoption",..: 15 10 16 14 16 9 14 6 1 6 ...
pl_subsamp.Dates <-
subset(pl_subsamp, select = c(LoanOriginationDate, BorrowerRate)) %>%
separate(LoanOriginationDate, c("Year", "Month", "Day"), sep = "-") %>%
dplyr::select(-Day)
pl_subsamp.Dates$YearCut <-
as.Date(cut(pl_subsamp$LoanOriginationDate, breaks = "year"))
pl_subsamp.Dates$MonthCut <-
as.Date(cut(pl_subsamp$LoanOriginationDate, breaks = "month"))
qplot(x = Year, data = pl_subsamp.Dates)
pl_subsamp.Dates %>%
dplyr::select(c(Month, Year)) %>%
group_by(Year) %>%
summarise(MinMonth = min(Month),
MaxMonth = max(Month))
## Source: local data frame [10 x 3]
##
## Year MinMonth MaxMonth
## 1 2005 11 11
## 2 2006 01 12
## 3 2007 01 12
## 4 2008 01 10
## 5 2009 05 12
## 6 2010 01 12
## 7 2011 01 12
## 8 2012 01 12
## 9 2013 01 12
## 10 2014 01 03
A drop in loans in 2009 followed by increasing loans thereafter. This trend can be explained by the continuous decrease in interest rates since the crisis. Data for 2005 and 2014 does not cover all the year, that’s why we don’t have too many observations for these years. No loans were granted between 11/2008 and 04/2009.
# lognormal distribution
summary(pl_subsamp$MonthlyLoanPayment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 134.8 223.0 275.7 372.7 1778.0
qplot(data = pl_subsamp, x = MonthlyLoanPayment,
binwidth = 150) +
xlab("Monthly Loan Payment")
qplot(data = pl_subsamp, x = MonthlyLoanPayment,
binwidth = 0.2) +
scale_x_log10() +
xlab("Monthly Loan Payment")
qplot(data = pl_subsamp, x = MonthlyLoanPayment,
binwidth = 4) +
scale_x_sqrt() +
xlab("Monthly Loan Payment")
The mean of monthly loans is $293.3.
# almost all fees are 1%
pl_subsamp$Fees <- with(pl_subsamp, BorrowerRate - LenderYield)
summary(pl_subsamp$Fees)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01000 0.01000 0.01009 0.01000 0.05000
with(pl_subsamp, cor.test(BorrowerRate, LenderYield))
##
## Pearson's product-moment correlation
##
## data: BorrowerRate and LenderYield
## t = 2536.2, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9991927 0.9992536
## sample estimates:
## cor
## 0.9992237
BorrowerRate and LenderYield are heavily correlated. We’ll keep only one of them.
pl_subsamp$MonthlyDebtAmount <-
with(pl_subsamp, DebtToIncomeRatio * StatedMonthlyIncome)
summary(pl_subsamp$MonthlyDebtAmount)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 645 1067 1222 1625 11150 717
qplot(x = MonthlyDebtAmount, data = pl_subsamp, binwidth = 100) +
scale_x_continuous(limits = c(0, 5000),
breaks = seq(0, 5000, 500)) +
xlab("Monthly Debt Amount")
qplot(x = MonthlyDebtAmount, data = pl_subsamp) +
scale_x_log10() +
xlab("Monthly Debt Amount")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Mean debt amount is ~$1200.
There are 100K+ observations and 81 variables. We decide to keep only 16. Following are the retained variables: ProsperScore, Term, LoanStatus, BorrowerRate, LenderYield, ListingCategory..numeric., EmploymentStatus, EmploymentStatusDuration, IsBorrowerHomeowner, CurrentCreditLines, DebtToIncomeRatio, IncomeRange, StatedMonthlyIncome, LoanOriginalAmount, LoanOriginationDate, MonthlyLoanPayment. ProsperScore is an oridnal variable with values going from 1 (higher risk) to 11 (lower risk). LoanStatus, EmploymentStatus, IsBorrowerHomeowner, ListingCategory are categorical variables. IncomeRange is an interval variable.
The size of the sample will be cut to 10K in order to speed up processing.
I would like to build a model that predicts how much a borrower would pay for a loan. The main feature is BorrowerRate.
I will test all the features in the dataset, and filter them at each step of the analysis. However, I ’ve decided to remove some of them. Here are the reasons: - MonthlyLoanPayment: it can be implied from the loan duration, the rate and amount - ListingCategory: all categories are related to consumption, thus I don’t think it could have an impact on rates - Fees: almost all fees are 1% - LenderYield: linear relationship with BorrowerRate
# deleting some features
pl_subsamp <- subset(pl_subsamp,
select = -c(MonthlyLoanPayment,
ListingCategory,
LenderYield, Fees))
dim(pl_subsamp)
## [1] 10000 14
LoanOriginalAmount have an unusual distribution. Indeed, although many amounts are in the range $1K-$10K, we observe peaks at 4K, 11K, 16K, 21K and 26K. Groupings the amount in chunks of 5K displays a more bell-like histogram.
with(pl_subsamp, by(BorrowerRate, ProsperScore, summary))
## ProsperScore: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2100 0.2999 0.3125 0.3042 0.3177 0.3500
## --------------------------------------------------------
## ProsperScore: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1400 0.2555 0.2800 0.2732 0.3096 0.3500
## --------------------------------------------------------
## ProsperScore: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1159 0.2139 0.2469 0.2471 0.2850 0.3500
## --------------------------------------------------------
## ProsperScore: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0769 0.1772 0.2099 0.2236 0.2669 0.3600
## --------------------------------------------------------
## ProsperScore: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0769 0.1685 0.2179 0.2281 0.3100 0.3500
## --------------------------------------------------------
## ProsperScore: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0716 0.1550 0.1939 0.2055 0.2575 0.3500
## --------------------------------------------------------
## ProsperScore: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0799 0.1359 0.1726 0.1839 0.2449 0.3500
## --------------------------------------------------------
## ProsperScore: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0609 0.1139 0.1449 0.1526 0.1774 0.3500
## --------------------------------------------------------
## ProsperScore: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0586 0.0930 0.1151 0.1242 0.1400 0.3500
## --------------------------------------------------------
## ProsperScore: 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04990 0.07160 0.08790 0.09765 0.11530 0.35000
## --------------------------------------------------------
## ProsperScore: 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06050 0.07160 0.09900 0.09699 0.11240 0.18750
qplot(x = ProsperScore, y = BorrowerRate,
data = subset(pl_subsamp, !is.na(ProsperScore)),
geom = "boxplot") +
xlab("Prosper Score") +
ylab("Rate")
qplot(x = ProsperScore, y = BorrowerRate,
data = subset(pl_subsamp, ProsperScore %in% c(10, 11)),
geom = "boxplot") +
xlab("Prosper Score by Term (M)") +
ylab("Rate") +
facet_wrap(~Term)
Inversely proportional. However, loans with scoring 11 have higher rates than those scoring 10, which is unintuitive. What could be the reason? The second plot helps us explain this phenomenon. Indeed, there are no loans with a score of 11 and a term of 12 months. 12 months loans happen to be those with the lowest rates.
with(pl_subsamp, by(BorrowerRate, Term, summary))
## Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.0929 0.1499 0.1534 0.2176 0.2669
## --------------------------------------------------------
## Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.1274 0.1799 0.1920 0.2575 0.4500
## --------------------------------------------------------
## Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0766 0.1485 0.1850 0.1925 0.2304 0.3304
qplot(x = Term, y = BorrowerRate,
data = pl_subsamp, geom = "boxplot") +
xlab("Term (M)") +
ylab("Rate")
1 year loans have the lower rate. 3 and 5 years loans have almost the same mean, with more volatility on 3 years loans. 3 and 5 years loans means are not very different. Maybe there is more demand on 3 years loans thus putting more pressure on 3 years rates.
with(pl_subsamp, by(BorrowerRate, LoanStatus, summary))
## LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2 0.2 0.2 0.2 0.2 0.2
## --------------------------------------------------------
## LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1700 0.2375 0.2331 0.2974 0.3600
## --------------------------------------------------------
## LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.1177 0.1700 0.1853 0.2500 0.3500
## --------------------------------------------------------
## LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0604 0.1314 0.1765 0.1835 0.2301 0.3304
## --------------------------------------------------------
## LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0600 0.1605 0.2192 0.2202 0.2750 0.4500
## --------------------------------------------------------
## LoanStatus: FinalPaymentInProgress
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0890 0.1232 0.1474 0.1742 0.2124 0.2958
## --------------------------------------------------------
## LoanStatus: Past Due (>120 days)
## NULL
## --------------------------------------------------------
## LoanStatus: Past Due (1-15 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0749 0.1665 0.2133 0.2133 0.2578 0.3177
## --------------------------------------------------------
## LoanStatus: Past Due (16-30 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0799 0.1789 0.2489 0.2294 0.2742 0.3177
## --------------------------------------------------------
## LoanStatus: Past Due (31-60 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0649 0.1760 0.2049 0.2152 0.2714 0.3304
## --------------------------------------------------------
## LoanStatus: Past Due (61-90 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1299 0.2148 0.2489 0.2475 0.3149 0.3177
## --------------------------------------------------------
## LoanStatus: Past Due (91-120 days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1269 0.1776 0.2298 0.2315 0.2766 0.3435
qplot(x = LoanStatus, y = BorrowerRate,
data = pl_subsamp, geom = "boxplot", fill = LoanStatus) +
scale_x_discrete(breaks = NULL) +
xlab("Loan Status") +
ylab("Rate")
Even if defaulted loans are among loans with the lowest rates, it does not necessarily mean that they are actually. One possibility, is that those loans were granted during a period of low rates. We’ll try to figure this out in the next section.
with(pl_subsamp, by(BorrowerRate, EmploymentStatus, summary))
## EmploymentStatus:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0295 0.1270 0.1775 0.1833 0.2375 0.4500
## --------------------------------------------------------
## EmploymentStatus: Employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1349 0.1819 0.1913 0.2489 0.3600
## --------------------------------------------------------
## EmploymentStatus: Full-time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1200 0.1700 0.1858 0.2457 0.3500
## --------------------------------------------------------
## EmploymentStatus: Not available
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.1400 0.1928 0.1926 0.2500 0.2900
## --------------------------------------------------------
## EmploymentStatus: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0495 0.1949 0.2511 0.2385 0.3058 0.3500
## --------------------------------------------------------
## EmploymentStatus: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0604 0.1568 0.2064 0.2122 0.2699 0.3304
## --------------------------------------------------------
## EmploymentStatus: Part-time
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1299 0.1662 0.1862 0.2430 0.3500
## --------------------------------------------------------
## EmploymentStatus: Retired
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0628 0.1147 0.1694 0.1821 0.2295 0.3500
## --------------------------------------------------------
## EmploymentStatus: Self-employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1435 0.1950 0.2040 0.2699 0.3500
qplot(x = EmploymentStatus, y = BorrowerRate,
data = subset(pl_subsamp, EmploymentStatus != ""),
geom = "boxplot", fill = EmploymentStatus) +
scale_x_discrete(breaks = NULL) +
xlab("Employment Status") +
ylab("Rate")
Full/Part-timers and retired pay lower rates. Non-employed pay higher rates.
ggplot(aes(x = EmploymentStatusDuration, y = BorrowerRate),
data = pl_subsamp) +
geom_point(alpha = 1/5) +
xlab("Employment Duration") +
ylab("Rate")
## Warning: Removed 644 rows containing missing values (geom_point).
with(pl_subsamp, cor.test(EmploymentStatusDuration, BorrowerRate))
##
## Pearson's product-moment correlation
##
## data: EmploymentStatusDuration and BorrowerRate
## t = -1.3703, df = 9354, p-value = 0.1706
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.034420102 0.006098648
## sample estimates:
## cor
## -0.01416654
We don’t observe any pattern. Moreover the correlation is almost non existent.
with(pl_subsamp, by(BorrowerRate, IsBorrowerHomeowner, summary))
## IsBorrowerHomeowner: False
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.1435 0.1950 0.2014 0.2599 0.4500
## --------------------------------------------------------
## IsBorrowerHomeowner: True
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1239 0.1699 0.1821 0.2346 0.3500
qplot(x = IsBorrowerHomeowner, y = BorrowerRate,
data = pl_subsamp, geom = "boxplot") +
xlab("Home Ownership") +
ylab("Rate")
Home owners pay lower rates in average. Probably because they bring more guarantees to the bank.
pl_subsamp.EmploymentStatusDurationCut <-
cut(pl_subsamp$EmploymentStatusDuration,
breaks = c(seq(0, 119, 12), seq(120, 599, 120), 634))
by(pl_subsamp$BorrowerRate, pl_subsamp.EmploymentStatusDurationCut, summary)
## pl_subsamp.EmploymentStatusDurationCut: (0,12]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0495 0.1310 0.1875 0.1953 0.2584 0.3600
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (12,24]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1277 0.1775 0.1891 0.2489 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (24,36]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1314 0.1899 0.1947 0.2566 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (36,48]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1314 0.1750 0.1918 0.2587 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (48,60]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0605 0.1435 0.1914 0.1999 0.2599 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (60,72]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0550 0.1325 0.1799 0.1892 0.2419 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (72,84]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1239 0.1768 0.1831 0.2344 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (84,96]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0520 0.1314 0.1832 0.1899 0.2414 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (96,108]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1350 0.1800 0.1924 0.2499 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (108,120]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0590 0.1270 0.1734 0.1872 0.2478 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (120,240]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1350 0.1817 0.1911 0.2489 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (240,360]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1355 0.1790 0.1902 0.2480 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (360,480]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0605 0.1242 0.1795 0.1865 0.2340 0.3500
## --------------------------------------------------------
## pl_subsamp.EmploymentStatusDurationCut: (480,634]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0759 0.1612 0.1849 0.2043 0.2662 0.3304
ggplot(aes(x = CurrentCreditLines, y = BorrowerRate),
data = subset(pl_subsamp, !is.na(CurrentCreditLines))) +
geom_point(alpha = 1/5, position = "jitter") +
xlab("Current Credit Lines") +
ylab("Rate")
with(pl_subsamp, cor.test(CurrentCreditLines, BorrowerRate))
##
## Pearson's product-moment correlation
##
## data: CurrentCreditLines and BorrowerRate
## t = -9.0742, df = 9356, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11344951 -0.07328034
## sample estimates:
## cor
## -0.09340294
There is no clear pattern.
ggplot(aes(x = DebtToIncomeRatio, y = BorrowerRate),
data = subset(pl_subsamp, !is.na(DebtToIncomeRatio))) +
geom_point(alpha = 1/5) +
coord_cartesian(xlim = c(0, 1)) +
xlab("Debt To Income Ratio") +
ylab("Rate")
with(pl_subsamp, cor.test(DebtToIncomeRatio, BorrowerRate))
##
## Pearson's product-moment correlation
##
## data: DebtToIncomeRatio and BorrowerRate
## t = 5.4205, df = 9281, p-value = 6.093e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03587453 0.07643211
## sample estimates:
## cor
## 0.0561765
The trend is not very marked, but there is a positive correlation between DebtToIncomeRatio and BorrowerRate.
with(pl_subsamp, by(BorrowerRate, IncomeRange, summary))
## IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0150 0.1350 0.1900 0.1897 0.2450 0.4500
## --------------------------------------------------------
## IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0495 0.1949 0.2511 0.2385 0.3058 0.3500
## --------------------------------------------------------
## IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1400 0.1700 0.1909 0.2444 0.3500
## --------------------------------------------------------
## IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1550 0.2130 0.2185 0.2862 0.3500
## --------------------------------------------------------
## IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1449 0.1995 0.2045 0.2629 0.3600
## --------------------------------------------------------
## IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1334 0.1800 0.1911 0.2492 0.3500
## --------------------------------------------------------
## IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0499 0.1249 0.1685 0.1798 0.2290 0.3500
## --------------------------------------------------------
## IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0565 0.1140 0.1520 0.1672 0.2085 0.3500
qplot(x = IncomeRange, y = BorrowerRate,
data = pl_subsamp, geom = "boxplot",
fill = IncomeRange) +
scale_x_discrete(breaks = NULL) +
xlab("Income Range") +
ylab("Rate")
The higher the income, the lower the rate. Suprisingly, many customers reporting $0 income have low rates.
ggplot(aes(x = StatedMonthlyIncome, y = BorrowerRate),
data = subset(pl_subsamp, !is.na(StatedMonthlyIncome))) +
geom_point(alpha = 1/5, na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99)) +
xlab("Stated Monthly Income") +
ylab("Rate") +
geom_smooth(method = "lm", color = "red", na.rm = TRUE)
with(pl_subsamp, cor.test(StatedMonthlyIncome, BorrowerRate))
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and BorrowerRate
## t = -11.988, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13831329 -0.09966838
## sample estimates:
## cor
## -0.1190359
The trend is not clear, but given the correlation, those reporting higher wages pay less rates.
ggplot(aes(x = LoanOriginalAmount / 1000, y = BorrowerRate),
data = pl_subsamp) +
geom_point(alpha = 1/5) +
xlab("Loan Original Amount, 1K") +
ylab("Rate") +
facet_wrap(~IncomeRange)
High-incomes have access to high amounts.
ggplot(aes(x = MonthCut, y = BorrowerRate),
data = pl_subsamp.Dates) +
stat_summary(fun.y = median, geom = "bar") +
scale_x_date(labels = date_format("%m-%Y"), breaks = "18 months") +
stat_smooth() +
xlab("Rate") +
ylab("Month")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
ggplot(aes(x = YearCut, y = BorrowerRate),
data = pl_subsamp.Dates) +
stat_summary(fun.y = median, geom = "bar") +
scale_x_date(labels = date_format("%Y"), breaks = "1 year") +
stat_smooth() +
xlab("Year") +
ylab("Rate")
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Rates reached a peak in 2010/2011 and a bottom in 2008/2014.
ggplot(aes(x = MonthlyDebtAmount, y = BorrowerRate),
data = subset(pl_subsamp, !is.na(MonthlyDebtAmount))) +
geom_point(alpha = 1/5, na.rm = TRUE) +
xlab("Monthly Debt Amount") +
ylab("Rate") +
xlim(0, quantile(pl_subsamp$MonthlyDebtAmount, 0.99, na.rm = TRUE))
There is no clear pattern in this visualisation.
BorrowerRate decreases with the risk profile. Lower the risk, lower is the rate. BorrowerRate increases with the loan’s term. BorrowerRate is low for full/part-time employees and retired, and high for non-employed. BorrowerRate is low for home owners. BorrowerRate is low for high-incomes, high for low-incomes. High-incomes have more access to credit. High-incomes can borrow higher amounts.
From the scatterplot matrix, we observe a correlation between: - Negative: StatedMonthlyIncome and DebtToIncomeRatio - Positive: StatedMonthlyIncome and LoanOriginalAmount - Positive: StatedMonthlyIncome and CurrentCreditLines - Positive: CurrentCreditLines and MonthlyDebtAmount
ggplot(aes(x = StatedMonthlyIncome, y = DebtToIncomeRatio),
data = pl_subsamp) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99, na.rm = TRUE)) +
ylim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
geom_point(alpha = 1/5, na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE) +
xlab("Monthly Income") + ylab("Debt Ratio")
with(pl_subsamp, cor.test(StatedMonthlyIncome, DebtToIncomeRatio))
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and DebtToIncomeRatio
## t = -10.713, df = 9281, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.13057458 -0.09038547
## sample estimates:
## cor
## -0.1105252
ggplot(aes(x = StatedMonthlyIncome, y = LoanOriginalAmount),
data = pl_subsamp) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99, na.rm = TRUE)) +
ylim(0, quantile(pl_subsamp$LoanOriginalAmount, 0.99, na.rm = TRUE)) +
geom_point(alpha = 1/5, na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE) +
xlab("Monthly Income") + ylab("Loan Amount")
with(pl_subsamp, cor.test(StatedMonthlyIncome, LoanOriginalAmount))
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and LoanOriginalAmount
## t = 25.708, df = 9998, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2305365 0.2673069
## sample estimates:
## cor
## 0.2490114
ggplot(aes(x = StatedMonthlyIncome, y = CurrentCreditLines),
data = pl_subsamp) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome, 0.99, na.rm = TRUE)) +
ylim(0, quantile(pl_subsamp$CurrentCreditLines, 0.99, na.rm = TRUE)) +
geom_point(alpha = 1/5) +
geom_smooth(method = "lm") +
xlab("Monthly Income") + ylab("Credit Lines")
## Warning: Removed 809 rows containing missing values (stat_smooth).
## Warning: Removed 809 rows containing missing values (geom_point).
with(pl_subsamp, cor.test(StatedMonthlyIncome, CurrentCreditLines))
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and CurrentCreditLines
## t = 17.091, df = 9356, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1542853 0.1935814
## sample estimates:
## cor
## 0.1740026
ggplot(aes(x = CurrentCreditLines, y = MonthlyDebtAmount),
data = pl_subsamp) +
xlim(0, quantile(pl_subsamp$CurrentCreditLines, 0.99, na.rm = TRUE)) +
ylim(0, quantile(pl_subsamp$MonthlyDebtAmount, 0.99, na.rm = TRUE)) +
geom_point(alpha = 1/5, na.rm = TRUE) +
geom_smooth(method = "lm", na.rm = TRUE) +
xlab("Credit Lines") + ylab("Monthly Debt Amount")
with(pl_subsamp, cor.test(CurrentCreditLines, MonthlyDebtAmount))
##
## Pearson's product-moment correlation
##
## data: CurrentCreditLines and MonthlyDebtAmount
## t = 62.237, df = 8648, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5414627 0.5705766
## sample estimates:
## cor
## 0.5561903
More you earn, more borrowing capacity you have. More you earn, more debt you accumulate.
The strongest relationship I found is between CurrentCreditLines and MonthlyDebtAmount. As you accumulate more debt, your monthly payments increase.
ggplot(aes(x = Term, y = BorrowerRate),
data = pl_subsamp) +
geom_boxplot() +
facet_wrap(~Year) +
xlab("Rate") + ylab("Term (M)")
Before 2010, there were only 3y loans. In 2010 appeared the first 5y loans. The year after appeared 1y loans. The 5y loans were attractive at the beginning as they offered lower rates than 3y loans. However they increased till surpassing 3y rates, which is more common.
qplot(x = LoanStatus, y = BorrowerRate,
data = pl_subsamp, geom = "boxplot",
fill = LoanStatus) +
scale_x_discrete(breaks = NULL) +
xlab("") + ylab("Term (M)") +
facet_wrap(~Year)
Unlike what we observed in the bivariate plots, it’s clear here that defaulted loans were borrowed at higher rates.
ggplot(aes(x = DebtToIncomeRatio, y = BorrowerRate, color = Year),
data = pl_subsamp) +
geom_point(na.rm = TRUE) +
xlab("Debt Ratio") + ylab("Rate") +
xlim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
scale_color_brewer(name = "Prosper Score", type = "div")
ggplot(aes(x = DebtToIncomeRatio, y = BorrowerRate, color = ProsperScore),
data = subset(pl_subsamp, !is.na(ProsperScore))) +
geom_point(na.rm = TRUE) +
xlab("Debt Ratio") + ylab("Rate") +
xlim(0, quantile(pl_subsamp$DebtToIncomeRatio, 0.99, na.rm = TRUE)) +
scale_color_brewer(name = "Prosper Score", type = "div") +
facet_wrap(~Year)
Less risky customers are offered lower rates.
# Coloring by ProsperScore
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
color = ProsperScore),
data = pl_subsamp) +
geom_point(na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99)) +
xlab("Monthly Income, 1K") +
ylab("Rate") +
scale_color_brewer(name = "Prosper Score", type = "div") +
facet_wrap(~Year)
# Coloring by Term
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
color = Term),
data = pl_subsamp) +
geom_point(na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
xlab("Monthly Income, 1K") +
ylab("Rate") +
scale_color_brewer(name = "Term", type = "div") +
facet_wrap(~Year)
# Coloring by DebtToIncomeRatio
summary(pl_subsamp$DebtToIncomeRatio)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1400 0.2200 0.2741 0.3100 10.0100 717
pl_subsamp.DebtToIncomeRatioCut <-
cut(pl_subsamp$DebtToIncomeRatio,
breaks = c(seq(0, 1, 0.2), 11))
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
color = pl_subsamp.DebtToIncomeRatioCut),
data = pl_subsamp) +
geom_point(na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
xlab("Monthly Income, 1K") +
ylab("Rate") +
scale_color_brewer(name = "Debt Ratio", type = "div") +
facet_wrap(~Year)
# Coloring by CurrentCreditLines
summary(pl_subsamp$CurrentCreditLines)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 10.00 10.38 13.00 54.00 642
pl_subsamp.CurrentCreditLinesCut <-
cut(pl_subsamp$CurrentCreditLines,
breaks = c(seq(0, 20, 5), 54))
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
color = pl_subsamp.CurrentCreditLinesCut),
data = pl_subsamp) +
geom_point(na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
xlab("Monthly Income, 1K") +
ylab("Rate") +
scale_color_brewer(name = "Credit Lines", type = "div") +
facet_wrap(~Year)
# Coloring by IsBorrowerHomeowner
ggplot(aes(x = StatedMonthlyIncome / 1000, y = BorrowerRate,
color = IsBorrowerHomeowner),
data = pl_subsamp) +
geom_point(na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$StatedMonthlyIncome/1000, 0.99, na.rm = TRUE)) +
xlab("Monthly Income, 1K") +
ylab("Rate") +
scale_color_brewer(name = "Home Owner", type = "div") +
facet_wrap(~Year)
ProsperScore is the only variable drawing a distinct frontier between different levels of rate. To explain the rate by other variables, we should imagine a combination of them. Non owners are mostly on the left of the plot; they have less income, but they could pay lower rates.
# Coloring by IncomeRange
ggplot(aes(x = MonthlyDebtAmount / 1000, y = BorrowerRate,
color = IncomeRange),
data = pl_subsamp) +
geom_point(alpha = 1/2, size = 2, position = "jitter", na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$MonthlyDebtAmount/1000, 0.99, na.rm = TRUE)) +
xlab("Monthly Debt, 1K") +
ylab("Rate") +
scale_color_brewer(name = "Monthly Income", type = "div") +
facet_wrap(~Year)
# DebtToIncomeRatio vs. BorrowerRate by Year
# Coloring by IncomeRange
ggplot(aes(x = DebtToIncomeRatio * 100, y = BorrowerRate,
color = IncomeRange),
data = pl_subsamp) +
geom_point(alpha = 1/2, size = 2, position = "jitter", na.rm = TRUE) +
xlim(0, quantile(pl_subsamp$DebtToIncomeRatio*100, 0.99, na.rm = TRUE)) +
xlab("Debt Ratio, %") +
ylab("Rate") +
scale_color_brewer(name = "Monthly Income", type = "div") +
facet_wrap(~Year)
The higher the income, the higher the debt, the less the rate. The higher the income, the less the debt ratio, the less the rate.
m1 <- lm(I(BorrowerRate) ~ I(as.numeric(as.character(Term))), data = pl_subsamp)
m2 <- update(m1, ~ . + DebtToIncomeRatio)
m3 <- update(m2, ~ . + StatedMonthlyIncome)
m4 <- update(m3, ~ . + as.logical(IsBorrowerHomeowner))
m5 <- update(m4, ~ . + as.numeric(as.character(Year)))
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))),
## data = pl_subsamp)
## m2: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) +
## DebtToIncomeRatio, data = pl_subsamp)
## m3: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) +
## DebtToIncomeRatio + StatedMonthlyIncome, data = pl_subsamp)
## m4: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) +
## DebtToIncomeRatio + StatedMonthlyIncome + as.logical(IsBorrowerHomeowner),
## data = pl_subsamp)
## m5: lm(formula = I(BorrowerRate) ~ I(as.numeric(as.character(Term))) +
## DebtToIncomeRatio + StatedMonthlyIncome + as.logical(IsBorrowerHomeowner) +
## as.numeric(as.character(Year)), data = pl_subsamp)
##
## =========================================================================================
## m1 m2 m3 m4 m5
## -----------------------------------------------------------------------------------------
## (Intercept) 0.185*** 0.180*** 0.188*** 0.193*** 1.246
## (0.003) (0.003) (0.003) (0.003) (0.639)
## I(as.numeric(as.character(Term))) 0.000* 0.000* 0.000** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## DebtToIncomeRatio 0.008*** 0.006*** 0.006*** 0.006***
## (0.001) (0.001) (0.001) (0.001)
## StatedMonthlyIncome -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000)
## as.logical(IsBorrowerHomeowner) -0.017*** -0.016***
## (0.002) (0.002)
## as.numeric(as.character(Year)) -0.001
## (0.000)
## -----------------------------------------------------------------------------------------
## R-squared 0.001 0.004 0.018 0.030 0.031
## adj. R-squared 0.000 0.004 0.018 0.030 0.030
## sigma 0.074 0.074 0.073 0.073 0.073
## F 5.333 17.911 57.352 72.752 58.754
## p 0.021 0.000 0.000 0.000 0.000
## Log-likelihood 11808.097 11046.887 11114.280 11172.349 11173.705
## Deviance 55.193 50.302 49.577 48.960 48.946
## AIC -23610.194 -22085.773 -22218.561 -22332.699 -22333.411
## BIC -23588.563 -22057.229 -22182.881 -22289.883 -22283.459
## N 10000 9283 9283 9283 9283
## =========================================================================================
# Model m3
standardLoan <- data.frame(Term = 36, DebtToIncomeRatio = 0.33,
StatedMonthlyIncome = 5000)
riskyLoan <- data.frame(Term = 60, DebtToIncomeRatio = 0.8,
StatedMonthlyIncome = 8000)
lowIncomeLoan <- data.frame(Term = 12, DebtToIncomeRatio = 0.2,
StatedMonthlyIncome = 2000)
extremeLoan <- data.frame(Term = 60, DebtToIncomeRatio = 10,
StatedMonthlyIncome = 5000)
modelSLEstimate <- predict(m3, newdata = standardLoan,
interval="prediction", level = .95)
print(modelSLEstimate)
## fit lwr upr
## 1 0.1898526 0.04656084 0.3331444
modelRLEstimate <- predict(m3, newdata = riskyLoan,
interval="prediction", level = .95)
print(modelRLEstimate)
## fit lwr upr
## 1 0.1930132 0.04968813 0.3363383
modelLILEstimate <- predict(m3, newdata = lowIncomeLoan,
interval="prediction", level = .95)
print(modelLILEstimate)
## fit lwr upr
## 1 0.1887025 0.04535051 0.3320545
modelELEstimate <- predict(m3, newdata = extremeLoan,
interval="prediction", level = .95)
print(modelELEstimate)
## fit lwr upr
## 1 0.2521724 0.106355 0.3979898
# Model m5
standardLoan <- data.frame(Term = 36, DebtToIncomeRatio = 0.33,
StatedMonthlyIncome = 5000, IsBorrowerHomeowner = TRUE,
Year = 2013)
riskyLoan <- data.frame(Term = 60, DebtToIncomeRatio = 0.8,
StatedMonthlyIncome = 8000, IsBorrowerHomeowner = FALSE,
Year = 2009)
lowIncomeLoan <- data.frame(Term = 12, DebtToIncomeRatio = 0.2,
StatedMonthlyIncome = 2000, IsBorrowerHomeowner = FALSE,
Year = 2013)
extremeLoan <- data.frame(Term = 60, DebtToIncomeRatio = 10,
StatedMonthlyIncome = 5000, IsBorrowerHomeowner = FALSE,
Year = 2009)
modelSLEstimate <- predict(m5, newdata = standardLoan,
interval="prediction", level = .95)
print(modelSLEstimate)
## fit lwr upr
## 1 0.1803261 0.03791799 0.3227342
modelRLEstimate <- predict(m5, newdata = riskyLoan,
interval="prediction", level = .95)
print(modelRLEstimate)
## fit lwr upr
## 1 0.2051351 0.0626807 0.3475895
modelLILEstimate <- predict(m5, newdata = lowIncomeLoan,
interval="prediction", level = .95)
print(modelLILEstimate)
## fit lwr upr
## 1 0.1926385 0.05015241 0.3351246
modelELEstimate <- predict(m5, newdata = extremeLoan,
interval="prediction", level = .95)
print(modelELEstimate)
## fit lwr upr
## 1 0.2679155 0.1229845 0.4128464
Given that rate change over time, we’ve plotted all visualisations by year. We attached a particular focus to StatedMonthlyIncome as it is an important variable when asking for a loan. ProsperScore was the only variable drawing a clear frontier between different rates. All other variables showed fuzzy visualisations and we couldn’t draw any conclusion. However,
The insteresting interaction came from defaulted loans. Indeed, back in the bivariate analysis, we found that loans that resulted in a default were among the loans with lower rates. Visualising by year showed that it’s not true. Most probably many defaults happened when rates were low. Central banks do lower rates in crisis periods in order to boost the ecenomy (low rates, more loans, more consumption and more investing, etc ..).
I’ve build 2 linear models: - 1st is based on Term, DebtToIncomeRatio, StatedMonthlyAmount - 2nd is based on IsBorrowerHomeowner, Year on top of model 1 Tested on 3 different profiles, they both give close values, ordered as expected: higher rate for risky, less for low income and less for standard profiles. The values of the predicted rate are close (all around 18%/19%) which is around the mean for those profiles. For a confidence interval of 95%, lower and upper values were also close. So this model is good to give an idea of what would be the average rate for a profile. For extreme cases, our extreme profile have a rate 6% to 8% higher than other profiles, and an lower/upper values of 12%/41% for a confidence interval of 95%, wihch is good enough.
When we plotted in the second section Rate vs. LoanStatus in box plots, we observed that the defaulted loan are having the smaller rates. Those risky loan were granted with confortable rates, which is non sense. Is the bank making a mistake in evaluating its borrowers? Our hypothesis was that most defaulted loans were granted in a period of economic trouble, where rates were systemically low. The plot shows that : - most defaults happended for loan issued between 2006-2008 - most loans were granted between 2012-1013 - mean rates for that periods were close These elements, give to some extent, some credit to our hypothesis.
DebtToIncomeRatio seem to be one important feature to look at before asking for a loan. Indeed, the less debt you have, the more leverage you have, and the less risk you carry, which end up with you paying low rates than your counterparts. However, when we look at this plot, we cannot draw any relationship between the debt ratio and the rate. There seems to be a small positive correlation (2011 is much bigger), but not enough to assert with conviction. On the other side, Faceting by ProsperScore, which is the risk property of the loan, we can see clearly that less risky loans pay smaller rates. Obviously, the debt ratio is not enough to measure the riskiness of a loan.
High-incomes may not have big debt ratios, but they may pay as much as low-incomes. On the plot, we can see high-incomes on the bottom-left triangle and low incomes on the top-right triangle (especially between 2012 and 2014).
So far, we’ve been trying to explain the borrower rate by some of the features provided in this data set. Armed with intuition and common sense, we thought that we’ll find some patterns and trends easily from the bi-variate plots. However, it turned out to be much tougher. Indeed, all patterns have been dug up in the multi-variate plots section. With the selected features, we’ve build a linear model that fit pretty well some profiles. But given the variance of rates for similar profiles, we’ll need to include more features in order to build a better model. This means that we’ll need to select other properties of the loans.